Load Data:
library(readr)
Untreated <- readRDS(paste0(wd,"/data/NCI_TPW_gep_untreated.rds"))
Treated <- readRDS(paste0(wd,"/data/NCI_TPW_gep_treated.rds"))
Filter Vorinostat data:
In the specific analysis, we have already filtered out the biomarkers for treatment with Vorinostat and carried out first analyses with them. For example, we have investigated whether the biomarkers also include the targets of vorinostat indicated in the drug annotation. However, this was not the case. So it is of interest to investigate which functions our biomarkers have and in which pathways they are involved.
First, the functions of the biomarkers will be selected by hand, stored in a table and then examined. Next, we will present and apply some of the numerous enrichment methods. For this purpose we will use the R package “cluster profiler”. Here we will perform and compare enrichment analyses on “Gene Ontology”, “Reactome Pathway” and “KEGG (Kyoto Encyclopedia of Genes and Genomes) Pathway”. Finally, we will briefly discuss Gene Set Variation Analysis.
1. Functional Analysis “by hand”
2. Functional Analysis using “cluster profiler”
3. Gene Set Variation Analysis
We used human gene-databases such as “GeneCards” to do research about our biomarkers. We collected information about the gene-function as well as about the tissue in which the respective genes are mainly expressed. Unfortunately, we were not able to assign a function and tissue to each of the biomarkers since some genes were not sufficiently researched yet. That lead to quite a lot NA-values in our table. On the other hand, we found great numbers of different functions for single genes. We tried to summarize the functions but were not able to consider them all. Consequently, our table is only suitable for the purpose to get a general overview and to investigate, whether we have biomarkers with partly similar features.
We used Excel to create a table with our top 50 biomarkers and their features, a part of the table can be seen below. Then we imported the annotation data in R and definded the tissue- and function-column for better readability:
vorinostat_annotation = read.csv2(paste0(wd,"/data/Biomarkers.csv"),header = TRUE, quote="\"")
tissue = vorinostat_annotation$affected.Tissue..if.specific.
general.function = vorinostat_annotation$general.Function
# rename
names(vorinostat_annotation)[names(vorinostat_annotation)=="X...Gene"] <- "Gene"
names(vorinostat_annotation)[names(vorinostat_annotation)=="affected.Tissue..if.specific."] <- "affected Tissue (if specific)"
knitr::kable(vorinostat_annotation, caption = "vorinostat annotation")
| Gene | Function | affected Tissue (if specific) | general.Function |
|---|---|---|---|
| FAM57A | amino acid transport and glutathione metabolism | lung | transport |
| HIST1H2BD | histone cluster | NA | chromatin binding |
| PMF1 | chromosome segregation | NA | DNA replication |
| JADE2 | histone acetyltransferase activity | NA | chromatin binding |
| LMNB1 | lamin protein | brain | cytoskeleton |
| AQP3 | water channel protein | NA | ion regulator |
| KIAA0368 | proteasome adaptor and scaffold | NA | proteinbiosynthesis |
| ENPP2 | ectonucleotide pyrophosphatase/phosphodiesterase | germ cells | phosphorylation |
| SYT11 | calcium sensor | brain | neurotransmitter regulation |
| HIST1H3F | histone cluster | brain | chromatin binding |
| VWA5A | von Willebrand Factor | blood | coagulation |
| LOC728392 | NA | NA | NA |
| TRIP6 | hormone receptor | NA | NA |
| RUNDC3B | NA | NA | NA |
| TMEM35B, ZMYM6 | transmembrane protein | NA | membrane |
| MCM5 | initiation of DNA replication | skin | chromatin binding |
| ZMYND11 | transcriptional repressor | brain | regulation of transcritption |
| NFKB1 | transcription factor | immue system | regulation of transcritption |
| PSMB10 | proteasome | NA | proteinbiosynthesis |
| HIST2H2BE | histone cluster | immue system | chromatin binding |
| AVEN | apoptosis and caspase activation inhibitor | brain | cell growth |
| PHF11 | PHD Finger Protein | immue system | chromatin binding |
| KIF5C | transport processes | brain | transport |
| CORO1A | cell cycle progression, signal transduction, apoptosis, and gene regulation | immue system | cell growth |
| MPDU1 | Mannose-P-Dolichol utilization | brain | metabolism |
| FOS | transcription factor subunit | bones | regulation of transcritption |
| DHRS2 | dehydrogenase/reductase enzyme | NA | NA |
| ABAT | GABA reduction | brain | neurotransmitter regulation |
| SERPINI1 | synaptic plasticity | brain | neurotransmitter regulation |
| MIR612///NEAT1 | transcription regulator of oncogenes | NA | regulation of transcritption |
| HBA2///HBA1 | globin for hemoglobin | blood | oxygen transport |
| CLU | chaperone | brain | proteinbiosynthesis |
| STC1 | ion transport | kidney | ion regulator |
| NMI | interacts with oncogenes | blood | regulation of transcritption |
| AREG | growth factor | brain | cell growth |
| CRISPLD2 | heparin binding and glycosaminoglycan binding | immue system | NA |
| NSMAF | binds phospholipids | NA | NA |
| HIST1H1C | histone cluster | NA | chromatin binding |
| SERPINH1 | serine protease inhibitor in ER | NA | NA |
| HIST1H2BJ///HIST1H2BG | histone cluster | NA | chromatin binding |
| SLC17A7 | ion transport | brain | ion regulator |
| TYMS | DNA-replication + repair | NA | DNA replication |
| ASMTL | o-methyltransferase | NA | NA |
| TNFSF9 | TNF, cytokine | blood | infammation |
| FAIM | apoptose protection, regulates B-cell signals + differentiation | blood | cell growth |
| ADI1 | downregulation of cell migration, HepC-Virus replication | NA | cell migration |
| TUBB2B | Tubulin (microtubuli-component), neuronal migration | brain | cytoskeleton |
| HMGA2 | transcription regulator, chromosome-condensation | NA | regulation of transcritption |
| RGL1 | cytoskeleton remodeling | NA | cytoskeleton |
| NEU1 | neuraminidase | NA | NA |
| HIST1H2BD | histone cluster | NA | chromatin binding |
The following barplot shows the prevalence of tissues, in which 50 of our biomarkers are preferentially expressed.
col=palette(rainbow(9))
table(tissue)
## tissue
## blood bones brain brain germ cells
## 5 1 12 1 1
## immue system kidney lung skin
## 5 1 1 1
barplot(table(tissue), ylab="counts", main="Affected tissues by biomarkers", col=col, las=3, cex.names = 0.7)
Vorinostat is used against T-cell lymphomas. They origin from T-cells which are a subgroup of white blood cells and of high importance for the immune defense. Thus, we expected Vorinostat to affect preferentially those areas. Some of our biomarkers are indeed mainly active in blood cells and in the immune system but we also found, that quite a lot of the biomarkers show activity in the brain.
col=palette(rainbow(16))
table(general.function)
## general.function
## DNA replication cell growth
## 2 4
## cell migration chromatin binding
## 1 9
## coagulation cytoskeleton
## 1 3
## infammation ion regulator
## 1 3
## membrane metabolism
## 1 1
## neurotransmitter regulation oxygen transport
## 3 1
## phosphorylation proteinbiosynthesis
## 1 3
## regulation of transcritption transport
## 6 2
barplot(table(general.function), ylab="counts", main="General functions of biomarkers", col=col, las=3, cex.names = 0.7)
In order to get a better overview of the function, we can plot only those functions which occur with a frequency greater or equal to two:
highfrequencyfunctions <- table(general.function)[table(general.function)>2]
col=palette(rainbow(7))
barplot(highfrequencyfunctions, ylab="counts", main="Prevalent functions of biomarkers", col=col, las=3, cex.names = 0.7)
The plot indicates, that Vorinostat affects amongst other things especially genes which are involved in chromatin binding and transcription regulation. We already knew, that the drugs working mechanism is to create an altered chromatin structure and gene accessibility. However, we did not know, that those altered structures have a high impact on genes involved in chromatin-organisation as well.
In an other column in the self created annotation table we specified the function of some biomarkers more precise. Some of the chromatin-binding biomarkers function as histone components.
length(grep(vorinostat_annotation$Function, pattern = "histone cluster"))
## [1] 6
In 50 researched biomarkers we found six histone genes which show a clearly altered expression after Vorinostat treatment. That leads to the assumption, that beside Vorinistats working mechanism as a histone modification regulator it might affect the number or composition of histones as well.
The R package “cluster profiler”, created by Guangchuang Yu et al. allows the user to create and visualize different functional profiles of genes and gene clusters using different enrichment methods. Here, we will focus on “Gene Ontology”, “Reactome Pathway” and “KEGG Pathway” methodes.
Gene Ontology is an enrichment Analysis, which hierarchically classifies genes or gene products to terms organized in an “Gene Ontology” system. The genes can be identified and ordered in three different ways (Yon Rhee et al., 2008):
In addition, a statistical test can be carried out on the assignment, which outputs a p-value. This value then indicates how significant the assignment has been. The smaller the value, the better the gene could be assigned to a group. For example, almost all genes can be assigned to a biological process. This assignment would therefore not be significant. In contrast, there are only a few genes that belong to the group “DNA repair”. The smaller the pValue of the assigned gene, the larger the association with this group. (http://geneontology.org/docs/go-enrichment-analysis/)
First of all, we need to load the libraries, we will work with:
library(clusterProfiler)
library(org.Hs.eg.db)
To be able to work with our biomarker genes in cluster profiler, we need to translate them into another Gene ID:
translated.genes= bitr(generalbiomarkergenes,
fromType="SYMBOL",
toType="ENTREZID",
OrgDb = org.Hs.eg.db)
head(translated.genes)
## SYMBOL ENTREZID
## 1 DHRS2 10202
## 2 ABAT 18
## 3 SERPINI1 5274
## 6 CLU 1191
## 7 STC1 6781
## 8 NMI 9111
As a first step, we will assign the biomarkers to the groups described above. First we define the “input genes” with with the correct Gene ID:
# load the needed library
library(DOSE)
# take the needed gene ID
gene=translated.genes$ENTREZID
head(gene)
## [1] "10202" "18" "5274" "1191" "6781" "9111"
Now we can group them, using all 3 categories:
## 1. Biological Process
ggo.bp <- groupGO(gene= gene, OrgDb = org.Hs.eg.db, ont = "BP", level = 3, readable = FALSE)
head(summary(ggo.bp))
## ID Description Count
## GO:0019953 GO:0019953 sexual reproduction 4
## GO:0019954 GO:0019954 asexual reproduction 0
## GO:0022414 GO:0022414 reproductive process 9
## GO:0032504 GO:0032504 multicellular organism reproduction 4
## GO:0032505 GO:0032505 reproduction of a single-celled organism 0
## GO:0061887 GO:0061887 reproduction of symbiont in host 0
## GeneRatio geneID
## GO:0019953 4/90 18/8061/330/8900
## GO:0019954 0/90
## GO:0022414 9/90 18/6781/374/8644/8061/2353/330/10370/8900
## GO:0032504 4/90 6781/8061/330/8900
## GO:0032505 0/90
## GO:0061887 0/90
## 2. Molecular Function
ggo.mf <- groupGO(gene= gene, OrgDb = org.Hs.eg.db, ont = "MF", level = 3, readable = FALSE)
head(summary(ggo.mf))
## ID
## GO:0001070 GO:0001070
## GO:0001072 GO:0001072
## GO:0001073 GO:0001073
## GO:0001134 GO:0001134
## GO:0003700 GO:0003700
## GO:0003711 GO:0003711
## Description
## GO:0001070 RNA-binding transcription regulator activity
## GO:0001072 transcription antitermination factor activity, RNA binding
## GO:0001073 transcription antitermination factor activity, DNA binding
## GO:0001134 transcription regulator recruiting activity
## GO:0003700 DNA-binding transcription factor activity
## GO:0003711 transcription elongation regulator activity
## Count GeneRatio
## GO:0001070 0 0/90
## GO:0001072 0 0/90
## GO:0001073 0 0/90
## GO:0001134 0 0/90
## GO:0003700 10 10/90
## GO:0003711 0 0/90
## geneID
## GO:0001070
## GO:0001072
## GO:0001073
## GO:0001134
## GO:0003700 8091/8061/2353/4790/10370/2969/2305/467/7157/23635
## GO:0003711
## 3. Cellular Component
ggo.cc <- groupGO(gene= gene, OrgDb = org.Hs.eg.db, ont = "CC", level = 3, readable = FALSE)
head(summary(ggo.cc))
## ID Description Count GeneRatio
## GO:0005886 GO:0005886 plasma membrane 26 26/90
## GO:0005628 GO:0005628 prospore membrane 0 0/90
## GO:0005789 GO:0005789 endoplasmic reticulum membrane 4 4/90
## GO:0019867 GO:0019867 outer membrane 1 1/90
## GO:0031090 GO:0031090 organelle membrane 17 17/90
## GO:0034357 GO:0034357 photosynthetic membrane 0 0/90
## geneID
## GO:0005886 6781/57030/8744/55256/4758/79850/360/1490/5168/23208/7205/11151/8061/6515/80328/4900/10486/8829/1368/55915/6253/306/27131/27340/2014/4804
## GO:0005628
## GO:0005789 374/9526/6253/2281
## GO:0019867 637
## GO:0031090 1191/374/57030/7298/4758/4001/23208/11151/51131/6515/4900/64921/57134/306/27131/637/2281
## GO:0034357
The result of the assignment can be visualized via a bar plot.
# visualization
par(mar=c(5, 4, 5, 9))
par(mfrow=c(1,3))
barplot(ggo.bp, drop=TRUE, showCategory=12)
barplot(ggo.mf, drop=TRUE, showCategory=12)
barplot(ggo.cc, drop=TRUE, showCategory=12)
We see that we have assignments in all Barplots in which almost all belong. In biological process, for example, we have an excessive number of genes involved in the “nitrogen compound metabolic process”, in molecular function we have a large number with “hydrolase activity” and all genes localised in the “cell part”. Consequently, these bar plots are not very informative. For this reason we will now additionally check the statistical parameters by an enrichment analysis to see how significant our allocations are.
Here we also need to translate the Gene IDs first:
gene.df <- bitr(generalbiomarkergenes, fromType = "SYMBOL",
toType = c("ENSEMBL"),
OrgDb = org.Hs.eg.db)
For the enrichment analysis we have to set some important statistical parameters. As we expect to have small pValues we set the significance level to 0.1. Because we have multiple compairisons, we also have to use the Adjust p.Value Method. Here we use the Benjamini-Hochberg Method, which controls the false discovery rate (FDR) and guarantees a powerful test. As limit for our q-value we have chosen 0.05, which is used by default.
# zum nachlesen
#http://www.nonlinear.com/support/progenesis/comet/faq/v2.0/pq-values.aspx
## 1. Biological Process
ego.bp <- enrichGO(gene = gene.df$ENSEMBL,
OrgDb = org.Hs.eg.db,
keyType = 'ENSEMBL',
ont = "BP",
pAdjustMethod = "BH",
pvalueCutoff = 0.01,
qvalueCutoff = 0.05)
head(summary(ego.bp))
## ID Description GeneRatio
## GO:0006689 GO:0006689 ganglioside catabolic process 8/95
## GO:0009313 GO:0009313 oligosaccharide catabolic process 8/95
## GO:0046479 GO:0046479 glycosphingolipid catabolic process 8/95
## GO:0019377 GO:0019377 glycolipid catabolic process 8/95
## GO:0046514 GO:0046514 ceramide catabolic process 8/95
## GO:0001573 GO:0001573 ganglioside metabolic process 8/95
## BgRatio pvalue p.adjust qvalue
## GO:0006689 14/20505 4.609982e-16 1.005437e-12 8.778376e-13
## GO:0009313 22/20505 4.762652e-14 3.462448e-11 3.023031e-11
## GO:0046479 22/20505 4.762652e-14 3.462448e-11 3.023031e-11
## GO:0019377 24/20505 1.087160e-13 5.927741e-11 5.175455e-11
## GO:0046514 25/20505 1.592733e-13 6.947503e-11 6.065799e-11
## GO:0001573 30/20505 8.457689e-13 3.074370e-10 2.684204e-10
## geneID
## GO:0006689 ENSG00000204386/ENSG00000227315/ENSG00000234343/ENSG00000234846/ENSG00000227129/ENSG00000184494/ENSG00000228691/ENSG00000223957
## GO:0009313 ENSG00000204386/ENSG00000227315/ENSG00000234343/ENSG00000234846/ENSG00000227129/ENSG00000184494/ENSG00000228691/ENSG00000223957
## GO:0046479 ENSG00000204386/ENSG00000227315/ENSG00000234343/ENSG00000234846/ENSG00000227129/ENSG00000184494/ENSG00000228691/ENSG00000223957
## GO:0019377 ENSG00000204386/ENSG00000227315/ENSG00000234343/ENSG00000234846/ENSG00000227129/ENSG00000184494/ENSG00000228691/ENSG00000223957
## GO:0046514 ENSG00000204386/ENSG00000227315/ENSG00000234343/ENSG00000234846/ENSG00000227129/ENSG00000184494/ENSG00000228691/ENSG00000223957
## GO:0001573 ENSG00000204386/ENSG00000227315/ENSG00000234343/ENSG00000234846/ENSG00000227129/ENSG00000184494/ENSG00000228691/ENSG00000223957
## Count
## GO:0006689 8
## GO:0009313 8
## GO:0046479 8
## GO:0019377 8
## GO:0046514 8
## GO:0001573 8
## 2. Molecular Function
ego.mf <- enrichGO(gene = gene.df$ENSEMBL,
OrgDb = org.Hs.eg.db,
keyType = 'ENSEMBL',
ont = "MF",
pAdjustMethod = "BH",
pvalueCutoff = 0.01,
qvalueCutoff = 0.05)
head(summary(ego.mf))
## ID Description
## GO:0052794 GO:0052794 exo-alpha-(2->3)-sialidase activity
## GO:0052795 GO:0052795 exo-alpha-(2->6)-sialidase activity
## GO:0052796 GO:0052796 exo-alpha-(2->8)-sialidase activity
## GO:0004308 GO:0004308 exo-alpha-sialidase activity
## GO:0016997 GO:0016997 alpha-sialidase activity
## GO:0004553 GO:0004553 hydrolase activity, hydrolyzing O-glycosyl compounds
## GeneRatio BgRatio pvalue p.adjust qvalue
## GO:0052794 8/97 12/19626 1.291220e-16 1.480599e-14 1.377301e-14
## GO:0052795 8/97 12/19626 1.291220e-16 1.480599e-14 1.377301e-14
## GO:0052796 8/97 12/19626 1.291220e-16 1.480599e-14 1.377301e-14
## GO:0004308 8/97 14/19626 7.770352e-16 5.346002e-14 4.973025e-14
## GO:0016997 8/97 14/19626 7.770352e-16 5.346002e-14 4.973025e-14
## GO:0004553 9/97 105/19626 2.450757e-09 1.405101e-07 1.307071e-07
## geneID
## GO:0052794 ENSG00000204386/ENSG00000227315/ENSG00000234343/ENSG00000234846/ENSG00000227129/ENSG00000184494/ENSG00000228691/ENSG00000223957
## GO:0052795 ENSG00000204386/ENSG00000227315/ENSG00000234343/ENSG00000234846/ENSG00000227129/ENSG00000184494/ENSG00000228691/ENSG00000223957
## GO:0052796 ENSG00000204386/ENSG00000227315/ENSG00000234343/ENSG00000234846/ENSG00000227129/ENSG00000184494/ENSG00000228691/ENSG00000223957
## GO:0004308 ENSG00000204386/ENSG00000227315/ENSG00000234343/ENSG00000234846/ENSG00000227129/ENSG00000184494/ENSG00000228691/ENSG00000223957
## GO:0016997 ENSG00000204386/ENSG00000227315/ENSG00000234343/ENSG00000234846/ENSG00000227129/ENSG00000184494/ENSG00000228691/ENSG00000223957
## GO:0004553 ENSG00000204386/ENSG00000227315/ENSG00000234343/ENSG00000234846/ENSG00000227129/ENSG00000184494/ENSG00000228691/ENSG00000223957/ENSG00000117643
## Count
## GO:0052794 8
## GO:0052795 8
## GO:0052796 8
## GO:0004308 8
## GO:0016997 8
## GO:0004553 9
## 3. Cellular Component
ego.cc <- enrichGO(gene = gene.df$ENSEMBL,
OrgDb = org.Hs.eg.db,
keyType = 'ENSEMBL',
ont = "CC",
pAdjustMethod = "BH",
pvalueCutoff = 0.01,
qvalueCutoff = 0.05)
head(summary(ego.cc))
## ID Description GeneRatio BgRatio
## GO:0035580 GO:0035580 specific granule lumen 9/99 83/21794
## GO:0042581 GO:0042581 specific granule 11/99 213/21794
## GO:0034774 GO:0034774 secretory granule lumen 13/99 374/21794
## GO:0060205 GO:0060205 cytoplasmic vesicle lumen 13/99 392/21794
## GO:0031983 GO:0031983 vesicle lumen 13/99 393/21794
## GO:0043202 GO:0043202 lysosomal lumen 8/99 111/21794
## pvalue p.adjust qvalue
## GO:0035580 1.415185e-10 3.240774e-08 2.785680e-08
## GO:0042581 3.565349e-09 4.082325e-07 3.509054e-07
## GO:0034774 1.486233e-08 1.134491e-06 9.751775e-07
## GO:0060205 2.586890e-08 1.220798e-06 1.049364e-06
## GO:0031983 2.665498e-08 1.220798e-06 1.049364e-06
## GO:0043202 4.087960e-08 1.560238e-06 1.341138e-06
## geneID
## GO:0035580 ENSG00000204386/ENSG00000227315/ENSG00000234343/ENSG00000234846/ENSG00000227129/ENSG00000184494/ENSG00000228691/ENSG00000223957/ENSG00000109320
## GO:0042581 ENSG00000204386/ENSG00000227315/ENSG00000234343/ENSG00000234846/ENSG00000227129/ENSG00000184494/ENSG00000228691/ENSG00000223957/ENSG00000109320/ENSG00000059804/ENSG00000138772
## GO:0034774 ENSG00000163536/ENSG00000120885/ENSG00000103196/ENSG00000204386/ENSG00000227315/ENSG00000234343/ENSG00000234846/ENSG00000227129/ENSG00000184494/ENSG00000228691/ENSG00000223957/ENSG00000109320/ENSG00000109107
## GO:0060205 ENSG00000163536/ENSG00000120885/ENSG00000103196/ENSG00000204386/ENSG00000227315/ENSG00000234343/ENSG00000234846/ENSG00000227129/ENSG00000184494/ENSG00000228691/ENSG00000223957/ENSG00000109320/ENSG00000109107
## GO:0031983 ENSG00000163536/ENSG00000120885/ENSG00000103196/ENSG00000204386/ENSG00000227315/ENSG00000234343/ENSG00000234846/ENSG00000227129/ENSG00000184494/ENSG00000228691/ENSG00000223957/ENSG00000109320/ENSG00000109107
## GO:0043202 ENSG00000204386/ENSG00000227315/ENSG00000234343/ENSG00000234846/ENSG00000227129/ENSG00000184494/ENSG00000228691/ENSG00000223957
## Count
## GO:0035580 9
## GO:0042581 11
## GO:0034774 13
## GO:0060205 13
## GO:0031983 13
## GO:0043202 8
The results of the enrichment can be visualized in many different ways. We will now present some possible plots based on biological function:
# Dot Plot
dotplot(ego.bp, showCategory=12)
This plot is similar to a bar plot, except that the bars were replaced by points. The size of the points indicates how many genes there are in this category and the color gives information about the pValue. In contrast to the barplot, less genes per category are displayed here. Since we have defined a p and q value, these are only “significant” categories.We see that categories with fewer genes have a smaller pValue and have therefore been better assigned.
# Gene-Concept Network
cnetplot(ego.bp)
The Gene-Concept Network shows us in which categories a gene occurs. Also here the size of the points is proportional to the number of genes in the corresponding category.
# Enrichment Map for enrichment result of over-representation test or gene set enrichment analysis
emapplot(ego.bp)
The enrichment map connects the two plots seen befor. We see the related pathways and how many genes with which pValues are allocated in one category.
Since we believe that dot plots are the easiest to interpret and show the most important information at first glance, we will now only display the other two GO enrichment results with these.
# Dot Plot Molecular Function
dotplot(ego.mf, showCategory=12)
Here we see the diffrent molecular functions of our biomarkers. Like in the plots above, the result here is diffrent from the result from the Gene Ontology grouping. Nevertheless, these results are not very meaningful as the listed modifications are not assigned to specific molecules.Consequently, we cannot say anything about their effects.
# Dot Plot Cellular Component
dotplot(ego.cc, showCategory=12)
This plot shows the location of the biomarkers in the different cellular componets. Also here we see other categories than in the Barplot. But even like the polt before it, this does not give us any important information about the influence on cancer.
Reactome Pathway contains a big database of biological processes like signal transduction, DNA repair and metabolism. Thus it can be used as analysis tool for discovering functional relationships between genes from expression data (Fabregat et al., 2018). It returns enriched pathways from a given vector of genes, thereby also performing FDR control. The statistical part works similar to the one of Gene Ontology (GO) Analysis.
As a first step, we need to load the Reactome library:
library(ReactomePA)
Here, as well, we have to translate the gene ID of our biomarkers into the ENTREZID form:
gene.df <- bitr(generalbiomarkergenes, fromType = "SYMBOL",
toType = c("ENTREZID"),
OrgDb = org.Hs.eg.db)
x <- enrichPathway(gene=gene.df$ENTREZID,
pvalueCutoff = 0.05,
readable = T )
head(as.data.frame(x))
## ID
## R-HSA-2559586 R-HSA-2559586
## R-HSA-2559583 R-HSA-2559583
## R-HSA-2262752 R-HSA-2262752
## R-HSA-2559582 R-HSA-2559582
## R-HSA-2559584 R-HSA-2559584
## R-HSA-8852276 R-HSA-8852276
## Description
## R-HSA-2559586 DNA Damage/Telomere Stress Induced Senescence
## R-HSA-2559583 Cellular Senescence
## R-HSA-2262752 Cellular responses to stress
## R-HSA-2559582 Senescence-Associated Secretory Phenotype (SASP)
## R-HSA-2559584 Formation of Senescence-Associated Heterochromatin Foci (SAHF)
## R-HSA-8852276 The role of GTSE1 in G2/M progression after G2 checkpoint
## GeneRatio BgRatio pvalue p.adjust qvalue
## R-HSA-2559586 10/60 80/10554 1.948831e-11 9.140019e-09 6.482428e-09
## R-HSA-2559583 12/60 195/10554 7.268531e-10 1.704471e-07 1.208872e-07
## R-HSA-2262752 16/60 426/10554 1.096103e-09 1.713575e-07 1.215328e-07
## R-HSA-2559582 8/60 110/10554 1.757728e-07 2.060936e-05 1.461689e-05
## R-HSA-2559584 4/60 16/10554 1.632265e-06 1.531064e-04 1.085886e-04
## R-HSA-8852276 6/60 75/10554 3.883378e-06 3.035507e-04 2.152890e-04
## geneID
## R-HSA-2559586 HIST1H1C/HMGA2/HIST1H2BD/LMNB1/HIST2H2BE/HIST1H4H/CCNA1/TP53/HIST1H2AE/CDKN1A
## R-HSA-2559583 HIST1H1C/HMGA2/HIST1H2BD/LMNB1/HIST2H2BE/FOS/NFKB1/HIST1H4H/CCNA1/TP53/HIST1H2AE/CDKN1A
## R-HSA-2262752 HIST1H1C/TUBB2B/HMGA2/HIST1H2BD/LMNB1/HIST2H2BE/FOS/NFKB1/PSMB10/CITED2/HIST1H4H/CCNA1/TP53/TUBB4A/HIST1H2AE/CDKN1A
## R-HSA-2559582 HIST1H2BD/HIST2H2BE/FOS/NFKB1/HIST1H4H/CCNA1/HIST1H2AE/CDKN1A
## R-HSA-2559584 HIST1H1C/HMGA2/LMNB1/TP53
## R-HSA-8852276 TUBB2B/PSMB10/GTSE1/TP53/TUBB4A/CDKN1A
## Count
## R-HSA-2559586 10
## R-HSA-2559583 12
## R-HSA-2262752 16
## R-HSA-2559582 8
## R-HSA-2559584 4
## R-HSA-8852276 6
For visualization we can use the same type of plots as for GO Analysis.
barplot(x,showCategory=12)
The color of the bars is associated with the pValue. The smaller the value, the better the assignment. Morover, we see categories that are crealy associated with cancer growth like “TP53 Regulates Transcription of Genes Involved in G1 Cell Cycle Arrest”. Now it would be interesting to see if these genes are regulated up or down. This can be visualized in a cnet plot where the FC is stained.
# FC data for coloring
FC=TreatedVorinostat-UntreatedVorinostat
FC<-FC[generalbiomarkergenes,]
names(FC) <- gene.df$ENTREZID
cnetplot(x,categorySize="geneNum",foldChange = FC)
We see that there are both highly and down-regulated genes in the same pathways. Unfortunately, the plot does not show whether the whole pathway is activated or inactivated.
The whole enrichment analysis can be summarized in the enrichment map. However, due to the numerous connecting lines it is no longer clear and the Fold Change is also missing in this plot.
emapplot(x)
The next enrichment method we use is KEGG Pathway Analysis. KEGG stands for Kyoto Encyclopedia of Genes and Genomes. It is a freely accessible database that contains information on various drugs, metabolic pathways and genes, among other things (Kanehisa and Goto, 2000). The enrichment process is the same as for the other two methods presented here.
First, we need to load the libaries we will work with:
library(KEGG.db)
library(org.Hs.eg.db)
library(enrichplot)
library(DOSE)
The gene IDs must also be translated:
gene.df <- bitr(generalbiomarkergenes, fromType = "SYMBOL",
toType = c("ENTREZID"),
OrgDb = org.Hs.eg.db)
Now we can perform the enrichment. We did not choose a Adjust p.Value method because we could find more categories this way. When setting a specific pAdjustMethod, we would only be able to identify 5 categories.
kk <- enrichKEGG(gene=gene.df$ENTREZID,
pvalueCutoff = 0.05,
pAdjustMethod = "none")
head(summary(kk))
## ID Description GeneRatio
## hsa05202 hsa05202 Transcriptional misregulation in cancer 8/49
## hsa04210 hsa04210 Apoptosis 6/49
## hsa05203 hsa05203 Viral carcinogenesis 7/49
## hsa05166 hsa05166 Human T-cell leukemia virus 1 infection 7/49
## hsa05161 hsa05161 Hepatitis B 6/49
## hsa04115 hsa04115 p53 signaling pathway 4/49
## BgRatio pvalue p.adjust qvalue
## hsa05202 186/7867 1.652177e-05 1.652177e-05 0.002173917
## hsa04210 136/7867 1.814709e-04 1.814709e-04 0.009703879
## hsa05203 201/7867 2.212484e-04 2.212484e-04 0.009703879
## hsa05166 219/7867 3.737403e-04 3.737403e-04 0.012294090
## hsa05161 163/7867 4.827212e-04 4.827212e-04 0.012703188
## hsa04115 72/7867 1.001411e-03 1.001411e-03 0.017869939
## geneID Count
## hsa05202 8091/4790/330/4291/8900/7157/1026/4804 8
## hsa04210 4001/2353/4790/330/7157/637 6
## hsa05203 3017/8349/4790/8365/8900/7157/1026 7
## hsa05166 8061/2353/4790/8829/8900/7157/1026 7
## hsa05161 2353/4790/8900/7157/1026/637 6
## hsa04115 51512/7157/1026/637 4
For the visualization of the results, we use the same plots as before, except for an additional heatplot that is based on a heatmap.
par(mfrow=c(2,1))
barplot(kk,showCategory=12)
# FC data for coloring
FC=TreatedVorinostat-UntreatedVorinostat
FC<-FC[generalbiomarkergenes,]
names(FC) <- gene.df$ENTREZID
cnetplot(kk,categorySize="pvalue",foldChange=FC)
We see that the KEGG database has found pathways that can be directly linked to vorinostat, such as the category “Human T-cell leukemia virus 1 infection”. As already mentioned in the introduction of the board and specific analysis, vorinostat is used in leukaemia diseases. Thus, in comparison KEGG provides the best drug-specific enrichment results. These are summarized in the following heatplot. Here we can see the single genes on the x-axis and on the y-axis the pathways they are involved in marked by small boxes. Since boxes are stained with the Fold Change, we can also read here whether a gene has been down or upregulated.
heatplot(kk, foldChange=FC)
In addition, the KEGG data can be used to create so-called pathway graphs. For this the library “pathview” is needed. A pathway ID from the KEGG Pathway enrichment result is entered into the function and the graphic is created. This will be demonstrated in the following using the first listed pathway from the previous KEGG enrichment analysis.
Load the library:
library(pathview)
Load the pathway graph:
pathview(gene.data = FC,
pathway.id = "hsa05202",
species = "hsa",
limit = list(gene=5, cpd=1))
Here we see the pathway “Transcriptional misregulation in cancer - Homo sapiens (human)”. There are diffent flow charts with the corresponding genes for dirffent cancer types. With this information one could possibly make predictions in individual cancer therapy. If a patient’s gene expression is known, it can be compared with our biomarkers and a prediction can be made of the ways in which vorinostat might work. In the pathway graphs, individual genes and reaction pathways can be specifically observed.
Finally, we performed the Gene Set Variation Analysis, short GSVA. GSVA is an analysis method that exports gene expression profiles into pathway or siganture summary. Consequently, it allows easier interpretation of the data and helps to model the pathway activation (Haenzelmann et al., 2013).
First we perform the GSVA enrichment, resulting in an enrichment score for each gene and each sample:
# create data needed for GSVA Analysis
generalbiomarkergenes=as.list(generalbiomarkergenes)
FC=TreatedVorinostat-UntreatedVorinostat
library(GSVA)
g<- gsva(FC,
generalbiomarkergenes,
mx.diff=TRUE,
verbose=TRUE)
The enrichment results can be visualized by an heatmap :
heat <- t(scale(t(g)))
myCol <- colorRampPalette(c("dodgerblue", "black", "yellow"))(100)
myBreaks <- seq(-1.5, 1.5, length.out=101)
library(gplots)
png(filename = "GSVAheat.png")
GSVAheat= heatmap.2(heat,
col=myCol,
breaks=myBreaks,
main="GSVA Enrichment Score of Biomarker",
xlab="genes",
ylab="samples",
labRow="",
labCol="",
key=TRUE,
keysize=1.0,
key.title="",
key.xlab="Enrichment Z-score",
scale="none",
density.info="none",
reorderfun=function(d,w) reorder(d, w, agglo.FUN=mean),
trace="none",
cexRow=1.0,
cexCol=1.0,
distfun=function(x) dist(x, method="euclidean"),
hclustfun=function(x) hclust(x, method="ward.D2"))
The enrichement score summarizes the gene expression of each single genes and is given als the maximum derivation of zero. In our heatmap we can see two big gene groups looking on the left side, which also can be found in the dendogram of the heatmap.
Since we have a mapping of a gene and a sample, we see that the gsva enrichment did not work properly, because we were supposed to get groups of genes in a particular pathway. Such a heatmap with pathways would look like the following. It was created by Haenzelmann et al. with leukemia data.
We can see the diffrent samples grouped by pathways. The pictures shows the diffrentially activated pathways in the Leukemia data set.
Finally, looking at all enrichment methods, it can be said that KEGG is best suited for our purposes. On the one hand it provides cancer- and drug-specific pathway information and on the other hand it can be used to create pathway graphs that could be used for predictions in individual therapy. Gene Ontology Analysis provides only very superficial enrichment results and is more suitable for sorting genes by broad functions. It has been shown how important the inclusion of statistical parameters is. Without these, one often gets classifications that apply to all genes and which are not specific. The Reactome Pathway Analysis had provided us with more specific analysis results, including cancer-specific metabolic pathways. However, in contrast to KEGG, the information about the progression of the pathways and the more precise function of individual genes was missing. Since the GSVA enrichment did not work properly with us, no statement can be made about this method.
Fabregat, A., Jupe, S., Matthews, L., Sidiropoulos, K., Gillespie, M., Garapati, P., Haw, R., Jassal, B., Korninger, F., May, B., et al. (2018). The Reactome Pathway Knowledgebase. Nucleic Acids Res 46, D649-d655.
Haenzelmann, S., Castelo, R., and Guinney, J. (2013). GSVA: gene set variation analysis for microarray and RNA-Seq data. BMC Bioinformatics 14, 7.
Kanehisa, M., and Goto, S. (2000). KEGG: kyoto encyclopedia of genes and genomes. Nucleic acids research 28, 27-30.
Yon Rhee, S., Wood, V., Dolinski, K., and Draghici, S. (2008). Use and misuse of the gene ontology annotations. Nature Reviews Genetics 9, 509.